Machine Learning on Polutions from Transportation

In the following program, we would guide you through using Pandas to process the emission data for Tensorflow Machine Learning. Then we would teach you how to create and train your Tensorflow model. Answer the questions when you see Q; follow the steps in To-do. When you see something like $^{D1}$ or $^{M1}$ next to problems, you should refer to the rubrics to see how the the problems will be graded as those problems are worth points.

Note: Hit the "Run" button to run the program block by block. We don't recommend you to use "Run All" in "Cell" because the first few blocks only need to be run once and they take some time to run.

Import Libraries

The following block is used in Python to import necessary libraries. You might encounter error while trying to import tensorflow. This is becuase Tensorflow is not a default library that comes with the Python package you installed. Go to this link https://www.tensorflow.org/install/pip#system-install and follow the instructions on installing Tensorflow. If you encounter problems while trying to install Tensorflow you can add --user after pip install. This is because you did not create a virtual environment for your python packages. You can follow Step 2 on the website to create a virtual environment (recommended) or you can just install the package in your HOME environment. You might encounter error while trying to import other libraries. Please use the same pip method described above.

Import Tensorboard

Load and Clean up the Dataset

Load the Dataset

To process the data, save the .csv file you downloaded from the Google Drive to the same directory where this Notebook is at.

Here is a link that contains information about meaning of the columns in "emission.csv": https://sumo.dlr.de/docs/Simulation/Output/EmissionOutput.html

Visualiz the Dataset

Below we use sns.pairplot() to show you the 2D plots between datasets. We only use 0.5% of the randomly extracted data from emission_train to make plots becuase using too many data might crash the program. .sample(frac=0.01) takes a fraction of sample from DataFrame randomly.

From the pair plots you can visualize the relationships between the data in the dataset. For example, vehicle_CO2 and vehicle_fuel have a linear relationship. vehicle_CO2 and vehicle_pos have a parabolic or exponential like relationship. Some data might have a relationship that is not easily identified from pair plots.

$^{D1}$Q: What do you find from the Pairplot? Find three pairs of data and list what you observe from their pair plots.

Type your questions to Q:

  • vehicle_CO2 and vehicle_fuel are extremely correlated, so it's likely they are the same measurement (fuel efficency?) in different units.
  • vehicle_CO2 and vehicle_noise appear to have a quadratic relationship, with the exception of some that are very low in CO2 but high in noise.
  • vehicle_CO2 and vehicle_angle do not appear to be correlated at all. </u>